Outline

Questions

  • Suppose our goal is to develop the decision rules for classifying wines based on the cultivars. Construct the decision rules using: (i) Classification Tree; (ii) LDA; (iii) QDA; (v) Nearest Neighbor; and (vi) SVM to classify the wines. Evaluate and compare all the obtained decision rules, which one will you recommend?
  • Suppose there are 6 wines from unknown cultivars with the measurements of 13 constituents shown in “wine_test.txt”. Based the “classification tree” you obtained in Q1, perform two stable procedures “bagging” and “boosting” to classify the 6 wines. Are the classifications of the 6 wines the same as those obtained from the five decision rules in Q1 ?
  • Exclude the first column (class-id) of the “wine.txt” data, perform a cluster analysis using (i) one hierarchical tree method (with best selected linkage); (ii) one partitioning method; and (iii) the self-organizing maps.
  • Do the clustering results in Q3 recover the original classification of the wine based on class-id ?




Solutions

In my opinion, before constructing any model, the first and foremost thing is doing Exploratory Data Analysis(Descriptive Statistics)

Let’s start from understanding the wine data !

Exploratory Data Analysis

As we seen, in aspect of tendency :

  • In the top 5 variables, the chemical results of A-cultivars are relative high
  • In the middle part, the chemical results of B-cultivars are relative low
  • In the bottom part, the chemical results of C-cultivars are relative high

Following requires attention:

  • In mg, althoug the value of B-cultivars tend to low, there are some outliers higher than the values of A-cultivars and C-cultivars !
  • We need to keep this in our mind: there is some outlier which potentially may influece to the accuracy of classification.

As results of EDA, there is something coming up in our mind:
Before we build up a classification model, the performance of classification may be not bad, and there is some variables significantly contribute to classify


Moves to next part, Correlation between chemical result and cultivars

Q1

In order to develop the favorable decision rule to classify cultivar,
we need to compare the true error rate by CV(leave-one-out) between all decision rule.

First, we construct all decision as follows, then we can combine all of results for comparing.

Tree

The result of tree shows that the tree does not necessarily be prune, since the tree is not complex.
Moreover, the ture error rate is 14%.
In the plot, the recommanded suggestion of cp are between 0.042 and 0.017, because which all below the horizon line. Hence, we choose 0.042 because the size of tree is more simple than 0.017, and the ture error rate is 15%.


Classification tree:
rpart(formula = class.id ~ ., data = wine.data, method = "class", 
    control = wine.control)

Variables actually used in tree construction:
[1] falvanoids hue        od         proline   

Root node error: 101/168 = 1

n= 168 

  CP nsplit rel error xerror xstd
1  0      0         1      1    0
2  0      1         1      1    0
3  0      2         0      0    0
4  0      3         0      0    0
5  0      4         0      0    0


Classification tree:
rpart(formula = class.id ~ ., data = wine.data, method = "class", 
    control = wine.control)

Variables actually used in tree construction:
[1] falvanoids od         proline   

Root node error: 101/168 = 1

n= 168 

  CP nsplit rel error xerror xstd
1  0      0         1      1    0
2  0      1         1      1    0
3  0      2         0      0    0
4  0      3         0      0    0

LDA

The ture error rate of LDA is 1.7%

   
     A  B  C
  A 55  0  0
  B  1 65  1
  C  0  1 45

QDA

The ture error rate of LDA is 0.6%

   
     A  B  C
  A 55  0  0
  B  1 66  0
  C  0  0 46

NN

After repeatly performing NN classifier with different K, the favorable K would be 2 to 4.

   k.choice ture.error.rate
1         1            0.00
2         2            0.13
3         3            0.14
4         4            0.17
5         5            0.22
6         6            0.23
7         7            0.21
8         8            0.23
9         9            0.24
10       10            0.21
11       11            0.23
12       12            0.23
13       13            0.24
14       14            0.26
15       15            0.27
16       16            0.27
17       17            0.27
18       18            0.27
19       19            0.27
20       20            0.27

SVM

We select parameters(cost, gamma) both from 0.1 to 1 in a leave-one-out situation. There are 100 results come out.
The best parameters set is located at gamma: 0.1, cost:0.4 to 1 which are the same. We can assume that SVM achieve the best performance when gamma=0.1 and cost from 0.4 to 1 .

    c gamma acc
1 0.4   0.1  98
2 0.5   0.1  98
3 0.6   0.1  98
4 0.7   0.1  98
5 0.8   0.1  98
6 0.9   0.1  98
7 1.0   0.1  98

Comparing for all classifiers

Above all, the lowest true error rate are QDA and SVM.

Q2

Bagging & Boosting

We use bagging to optimize the prediction by the result of prune tree and reduce the error of sampling from training data.
We find out that the Bagging result of prune tree will be different from the original one.
That is Bagging method improve the prediction ability.

As result, the prediction result of bagging is slightly diferent to prediction result without bagging.

    method result
1  Bagging BBBBCC
2 Boosting BBBCCA
3     Tree BBBBBC
4      LDA BBCBCB
5      QDA BBBBCB
6       NN CBBBCA
7      SVM ABBBBB

Q3 & Q4

Hierarchical tree method with ward linkage

As result, we can clearly identify that there is three groups in wine.data because the banner plot in the below obiviously can be divided to three parts.

In order to examine the result of Hierarchical tree method with ward linkage,
we can check : Agglomerative Coefficient(AC) and whether the resulting groups agrees with original class.id

  • AC as follows indicates that there is strong structure in wine data
[1] 0.94
  • The resulting groups support original class.id (each position of original class.id is very close)
  [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B A A
 [36] A B A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B
 [71] B B B C C C B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[106] B B B B B B B B B B B B B B B B B B B C C C C C C C C C C C C C C B C
[141] C C C C C C C C C C C C C C C C C C C C C C C C C C C C
Levels: A B C
  • Outlier test

K-means approach

At first, we need to choose K group before performing K-means approach.
Several suggestions have been made as to how to choose the number fo groups.

Based on the most simple way to select K,
we can use the solution from hierarchical clustering, 3 groups.

The texts in the plot are the result of k-means.
As result, few overlapping between three clusters and original class.id point out that the k-means is favorable to this data.


In previous section(EDA), because we found some outliers in this data,
Partitioning Around Medoids, a way which is more robust, is worth to try.

Average silouette width = 0.57 indicates that reasonable structure.
Namely, PAM is not bad but not excellent!

SOM approach

根據SOM的特性,顏色越近代表越相似,從圖可以看出資料可以被切成三個group

  [1] 1 1 2 2 2 2 2 2 3 3 4 1 1 2 2 2 2 2 3 3 4 1 1 1 2 2 2 2 3 3 4 4 1 1 1
 [36] 2 2 2 3 3 4 4 4 1 1 2 2 2 2 2 4 4 4 4 1 1 1 2 2 2 4 4 4 4 1 1 1 2 2 2
 [71] 5 5 4 4 4 1 1 1 2 2 5 5 4 4 4 1 1 1 2 2 5 5 5 4 4 1 1 1 1 2